Auditing Bias

Classification

Ethics

Auditing Bias in Automated Decision-Making Systems

Author

Lukka Wolff

Published

March 12, 2025

Abstract

This project audits bias in automated decision-making systems by analyzing employment predictions from the 2018 American Community Survey (ACS) data for Georgia. A Random Forest Classifier was trained to predict employment status based on demographic features such as age, education, sex, disability, and nativity, while examining racial bias specifically between White and Black/African American individuals. The audit revealed approximately balanced accuracy, positive predictive values, and error rates across these racial groups, though slight discrepancies exist. Despite good numerical fairness, we must still consider ethical considerations regarding consent, data recency, and the ethical deployment of such models in different decision-making contexts.

Data and Feature Selection

We are using the folkables package to access data from the 2018 American Community Survey’s Public Use Microdata Sample (PUMS) for the state of Georgia.

from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filter
import numpy as np

STATE = "GA"

data_source = ACSDataSource(survey_year='2018', 
                            horizon='1-Year', 
                            survey='person')

acs_data = data_source.get_data(states=[STATE], download=True)

acs_data.head()

	RT	SERIALNO	DIVISION	SPORDER	PUMA	REGION	ST	ADJINC	PWGTP	AGEP	...	PWGTP71	PWGTP72	PWGTP73	PWGTP74	PWGTP75	PWGTP76	PWGTP77	PWGTP78	PWGTP79	PWGTP80
0	P	2018GQ0000025	5	1	3700	3	13	1013097	68	51	...	124	69	65	63	117	66	14	68	114	121
1	P	2018GQ0000035	5	1	1900	3	13	1013097	69	56	...	69	69	7	5	119	74	78	72	127	6
2	P	2018GQ0000043	5	1	4000	3	13	1013097	89	23	...	166	88	13	13	15	91	163	13	89	98
3	P	2018GQ0000061	5	1	500	3	13	1013097	10	43	...	19	20	3	9	20	3	3	10	10	10
4	P	2018GQ0000076	5	1	4300	3	13	1013097	11	20	...	13	2	14	2	1	2	2	13	14	12

5 rows × 286 columns

This data set contains a large amount of features for each individual, so we are going to narrow it down to only those that we may use to train our model.

possible_features=['AGEP', 'SCHL', 'MAR', 'RELP', 'DIS', 'ESP', 'CIT', 'MIG', 'MIL', 'ANC', 'NATIVITY', 'DEAR', 'DEYE', 'DREM', 'SEX', 'RAC1P', 'ESR']
acs_data[possible_features].head()

	AGEP	SCHL	MAR	RELP	DIS	ESP	CIT	MIG	MIL	ANC	NATIVITY	DEAR	DEYE	DREM	SEX	RAC1P	ESR
0	51	13.0	5	16	2	NaN	1	3.0	4.0	1	1	2	2	2.0	1	2	6.0
1	56	16.0	3	16	1	NaN	1	1.0	4.0	4	1	2	1	2.0	2	1	6.0
2	23	20.0	5	17	1	NaN	1	1.0	4.0	4	1	2	2	1.0	2	2	1.0
3	43	17.0	1	16	2	NaN	1	1.0	4.0	1	1	2	2	2.0	1	2	6.0
4	20	19.0	5	16	2	NaN	1	1.0	4.0	1	1	2	2	2.0	2	1	6.0

A few key features to note are:

ESR is employment status (1 if employed, 0 if not)
RAC1P is race (1 for White Alone, 2 for Black/African American Alone, 3 and above for other self-identified racial groups)
SEX is binary sex (1 for male, 2 for female)

Now we select for the features we want to use and we will be able to constuct a BasicProblem that expresses our desire to use these features to predict employment status ESR, using RAC1P as the group label.

features_to_use = [f for f in possible_features if f not in ["ESR", "RAC1P"]]

EmploymentProblem = BasicProblem(
    features=features_to_use,
    target='ESR',
    target_transform=lambda x: x == 1,
    group='RAC1P',
    preprocess=lambda x: x,
    postprocess=lambda x: np.nan_to_num(x, -1),
)

features, label, group = EmploymentProblem.df_to_numpy(acs_data)

We now have a feature matrix features, a label vector label, and a group label vector group.

for obj in [features, label, group]:
  print(obj.shape)

(100855, 15)
(100855,)
(100855,)

We are now going to split our data into a training set and a testing set:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
    features, label, group, test_size=0.2, random_state=0)

Data Exploration

Before we dive straight into model training, lets take a deeper look at our data.

import pandas as pd
df = pd.DataFrame(X_train, columns = features_to_use)
df["group"] = group_train
df["label"] = y_train

Here is a quick look at our data frame containing our training data:

df.head()

	AGEP	SCHL	MAR	RELP	DIS	ESP	CIT	MIG	MIL	ANC	NATIVITY	DEAR	DEYE	DREM	SEX	group	label
0	48.0	16.0	1.0	1.0	2.0	0.0	1.0	1.0	4.0	1.0	1.0	2.0	2.0	2.0	1.0	2	True
1	52.0	24.0	2.0	0.0	2.0	0.0	1.0	3.0	4.0	1.0	1.0	2.0	2.0	2.0	1.0	1	True
2	55.0	18.0	5.0	0.0	2.0	0.0	1.0	1.0	4.0	3.0	1.0	2.0	2.0	2.0	1.0	2	True
3	15.0	12.0	5.0	3.0	2.0	1.0	4.0	3.0	0.0	1.0	2.0	2.0	2.0	2.0	2.0	6	False
4	26.0	22.0	1.0	0.0	2.0	0.0	5.0	1.0	4.0	1.0	2.0	2.0	2.0	2.0	2.0	2	False

len(df)

This data set contains information from \(80684\) individuals in the state of Georgia.

df["label"].value_counts()

label
False    44664
True     36020
Name: count, dtype: int64

df['label'].mean()

np.float64(0.44643299786822666)

Of these individuals, 44.64% or \(36020\) individuals are employed.

df['group'].value_counts()

group
1    53302
2    20239
6     3267
9     1996
8     1589
3      159
5       66
7       66
Name: count, dtype: int64

df['group'].value_counts(normalize=True)

group
1    0.660627
2    0.250843
6    0.040491
9    0.024738
8    0.019694
3    0.001971
5    0.000818
7    0.000818
Name: proportion, dtype: float64

The two largest racial groups are 1 White Alone with 53302 individuals making up 66% of the data, and 2 Black/African American Alone with 20239 individuals making up 25% of the data.

df.groupby('group')['label'].mean()

group
1    0.460771
2    0.416127
3    0.433962
5    0.348485
6    0.481175
7    0.484848
8    0.429830
9    0.330160
Name: label, dtype: float64

~46% of White Alone individuals are employed and ~42% of Black/African American Alone individuals are employed.

import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")

def plot_intersection(df, col1='group', col2='sex'):
    prop_df = df.groupby([col1, col2])['label'].mean().reset_index()
    plt.figure(figsize=(8, 5))

    ax = sns.barplot(x=col1, y='label', hue=col2, data=prop_df)
    
    plt.title(f'Employment Proportion by {col1} and {col2}')
    plt.xlabel(col1)
    plt.ylabel(f'Employment Proportion (%)')

    for p in ax.patches:
        ax.annotate(f'{p.get_height()*100:.2f}', 
                   (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha = 'center', va = 'bottom',
                   xytext = (0, 5), textcoords = 'offset points')
    
    plt.tight_layout()
    plt.show()

group_sex = plot_intersection(df, 'group', 'SEX')

For many groups, the percentage of men who are employed is higher than that of women. One notable group where this is not the case is 2 Black/African American where the percentage of employed women is ~44% against ~39% for men.

NATIVITY indicates a persons place of birth. 1 being Native born and 2 being Foreign born.

group_nativity = plot_intersection(df, 'group', 'NATIVITY')

Interestingly, across the board we see that the percentage of foreign born individuals who are employed is much higher than the proportion of native born individuals. However, as seen in the plot below, this may be attributed to the fact that few foreign born individuals on the extremities of the age, reducing the influence of youth and seniority as factors in employment proportion.

sns.displot(data=df, x="AGEP", hue="NATIVITY", kind="kde", bw_adjust=0.5, fill=True, alpha=0.75)

DIS represents an individuals disability status. 1 with disability, and 2 without a disability.

group_dis = plot_intersection(df, 'group', 'DIS')

Across the board we see that people without disability are employed at a much higher proportion than people with disabilities.

Supplementary plots that I thought were interesting.

cit_sex = plot_intersection(df, 'CIT', 'SEX')

mar_sex = plot_intersection(df, 'MAR', 'SEX')

schl_sex = plot_intersection(df, 'SCHL', 'SEX')

Model Training

We are now ready to create a model and train it on our training data. We will first scale our data, then we will employ a Random Forest Classifier. This approach uses an array of decision trees on various sub-samples of the data and aggregates their results. Learn more here.

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score

acc = 0
best_depth = 0
for depth in range(5, 20):
    model = make_pipeline(StandardScaler(), RandomForestClassifier(max_depth=depth))
    model.fit(X_train, y_train)
    cv_scores = cross_val_score(model, X_train, y_train, cv = 5)
    if cv_scores.mean() > acc:
        best_depth = depth
        acc = cv_scores.mean()

print(f"Best maximum tree depth: {best_depth}, Accuracy: {acc*100:.2f}%")

Best maximum tree depth: 16, Accuracy: 83.39%

Above, we tuned our model complexity using the max_depth parameter of the RandomForestClassifier. This controls how deep each tree in our forest can get which impacts how general our model is with regards to things like overfitting. We examined values ranging from 5 to 20 for the max depth and found the highest cross validated accuracy when max_depth=16.

RF = make_pipeline(StandardScaler(), RandomForestClassifier(max_depth=best_depth))
RF.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('randomforestclassifier',
                 RandomForestClassifier(max_depth=16))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model Audit

Overall Measures

y_hat = RF.predict(X_test)
(y_hat == y_test).mean()

np.float64(0.8281195776114223)

Our model has an overall accuracy of 82.81% on the testing data suite. Not too shabby!

We will now address the positive predictive value (PPV) of our model (we wont evaluate true sufficiency right now as we are skipping NPV). Given that the prediction is positive (y_hat = 1), how likely is it that the prediction is accurate (y_test = 1)? In other words, if we predict someone to be employed, how likely is it that they are actually employed?

We can approximate this value with the following code:

tp = (y_hat == 1) & (y_test == 1)
fp = (y_hat == 1) & (y_test == 0)
ppv = tp.sum() / (tp.sum() + fp.sum())
print(f"Positive Predictive Value: {ppv*100:.2f}%")

Positive Predictive Value: 77.46%

So, when our model predicts someone is employed, they are actually employed 77.46% of the time.

from sklearn.metrics import ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_hat)
ConfusionMatrixDisplay(cm, display_labels=['Not Employed', 'Employed']).plot(cmap='Blues')
plt.title('Confusion Matrix')

Text(0.5, 1.0, 'Confusion Matrix')

Above is our confusion matrix, lets use this information to find the false negative and false positive rates for our model.

fn = (y_hat == 0) & (y_test == 1)
fnr = fn.sum() / (tp.sum() + fn.sum())
print(f"False Negative Rate: {fnr*100:.2f}%")

False Negative Rate: 12.74%

Our model incorrectly classified people as unemployed when they were in fact employed 12.74% of the time.

fp = (y_hat == 1) & (y_test == 0)
tn = (y_hat == 0) & (y_test == 0)
fpr = fp.sum() / (fp.sum() + tn.sum())
print(f"False Positive Rate: {fpr*100:.2f}%")

False Positive Rate: 20.84%

Our model incorrectly classified people as employed when they were in fact unemployed 20.84% of the time.

By-Group Measures

Now, lets explore how our model treats people in their respective groups. We are going to focus primarily on the possible discrepancies between individuals in 1 White Alone and 2 Black/African American Alone groups.

wa = (y_hat == y_test)[group_test == 1].mean()
aa = (y_hat == y_test)[group_test == 2].mean()
print(f"White Alone Accuracy: {wa*100:.2f}%\nBlack/African American Alone Accuracy: {aa*100:.2f}%")

White Alone Accuracy: 82.40%
Black/African American Alone Accuracy: 83.17%

Our model has pretty comparable accuracy scores for both groups, only slightly lower for White Alone individuals.

tp_wa = ((y_hat == 1) & (y_test == 1) & (group_test == 1))
fp_wa = ((y_hat == 1) & (y_test == 0) & (group_test == 1))
ppv_wa = tp_wa.sum() / (tp_wa.sum() + fp_wa.sum())

tp_aa = ((y_hat == 1) & (y_test == 1) & (group_test == 2))
fp_aa = ((y_hat == 1) & (y_test == 0) & (group_test == 2))
ppv_aa = tp_aa.sum() / (tp_aa.sum() + fp_aa.sum())

print(f"PPV for White Alone: {ppv_wa*100:.2f}%")
print(f"PPV for Black/African American Alone: {ppv_aa*100:.2f}%")

PPV for White Alone: 77.96%
PPV for Black/African American Alone: 75.84%

Once again our PPV rates are quite comparable, although the White Alone group does have a slightly higher PPV.

fn_wa = (y_hat == 0) & (y_test == 1) & (group_test == 1)
fnr_wa = fn_wa.sum() / (tp_wa.sum() + fn_wa.sum())

fn_aa = (y_hat == 0) & (y_test == 1) & (group_test == 2)
fnr_aa = fn_aa.sum() / (tp_aa.sum() + fn_aa.sum())

print(f"FNR for White Alone: {fnr*100:.2f}%")
print(f"FNR for Black/African American Alone: {fnr_aa*100:.2f}%")

FNR for White Alone: 12.74%
FNR for Black/African American Alone: 12.86%

The false negative rates are once again very similar across groups.

fp_wa = ((y_hat == 1) & (y_test == 0) & (group_test == 1))
tn_wa = ((y_hat == 0) & (y_test == 0) & (group_test == 1))
fpr_wa = fp_wa.sum() / (fp_wa.sum() + tn_wa.sum())

fp_aa = ((y_hat == 1) & (y_test == 0) & (group_test == 2))
tn_aa = ((y_hat == 0) & (y_test == 0) & (group_test == 2))
fpr_aa = fp_aa.sum() / (fp_aa.sum() + tn_aa.sum())

print(f"FPR for White Alone: {fpr_wa*100:.2f}%")
print(f"FPR for Black/African American Alone: {fpr_aa*100:.2f}%")

FPR for White Alone: 21.33%
FPR for Black/African American Alone: 19.64%

There is some slight discrepancy here as persons in the White Alone group is more often mistaken for having a job than persons in the Black/African American Alone group.

Bias Measures

In terms of accuracy, our model seems to be performing well. Let’s take a deeper look at how our model might be biased or unfair by examining calibration, error rate balance, and statistical parity.

Our model can be considered well-calibrated or sufficient if it reflects equal likelihood of employment irrespective of the individuals’ group membership. That is, free from predictive bias, this our PPV for both groups should be the same. Looking back to our calculation of these scores, we saw that they were about equal. PPV for White Alone: 77.96% and PPV for Black/African American Alone: 75.84%. Thus, we can say our model is well-calibrated.

Our model can only satisfy approximate error rate balance given that the true positive rate (TPR) and false positive rates (FPR) be equal on the two groups.

tpr_wa = tp_wa.sum() / (tp_wa.sum() + fn_wa.sum())
tpr_aa = tp_aa.sum() / (tp_aa.sum() + fn_aa.sum())
print(f"TPR --> WA: {tpr_wa*100:.2f}% ~~ AA: {tpr_aa*100:.2f}%")
print(f"FPR --> WA: {fpr_wa*100:.2f}% ~~ AA: {fpr_aa*100:.2f}%")

TPR --> WA: 86.69% ~~ AA: 87.14%
FPR --> WA: 21.33% ~~ AA: 19.64%

We can see that both groups have an approximately equal TPR and FPRs. Thus our model satisfies approximate error rate balance.

Our model satisfies statistical parity if the proportion of individuals classified as employed is the same for each group.

prop_wa = (y_hat == 1)[group_test == 1].mean()
prop_aa = (y_hat == 1)[group_test == 2].mean()
print(f"Proportion of White Alone Predicted to be Employed: {prop_wa*100:.2f}%")
print(f"Proportion of Black/African American Alone Predicted to be Employed: {prop_aa*100:.2f}%")

Proportion of White Alone Predicted to be Employed: 51.74%
Proportion of Black/African American Alone Predicted to be Employed: 47.61%

We can observe some differences in these two scores, a higher proportion of White Alone persons are predicted to be employed than Black/African American Alone persons. The difference remains small as the rated are within 5% of one another… as we don’t have a set threshold we cant know if the difference is significant, ergo we cant say that we do or do not satisfy statistical parity.

Feasible FNR and FPRs

p_wa = ((y_test == 1) & (y_hat == 1))[group_test == 1].mean()
p_aa = ((y_test == 1) & (y_hat == 1))[group_test == 2].mean()
print(f"Prevalence of White Alone: {p_wa*100:.2f}%")
print(f"Prevalence of Black/African American Alone: {p_aa*100:.2f}%")

Prevalence of White Alone: 40.34%
Prevalence of Black/African American Alone: 36.11%

The proportion of true positive values is higher with in the White Alone group.

import numpy as np

plt.figure(figsize=(8, 5))


fnr_range = np.linspace(0, 1, 100)
fpr_aa_alt = (p_aa / (1 - p_aa)) * ((1 - ppv_aa) / ppv_aa) * (1 - fnr_aa)
fpr_wa_alt = (p_wa / (1 - p_wa)) * ((1 - ppv_wa) / ppv_wa) * (1 - fnr_wa)

# Calibrate on low PPV --> PPv from Black/African American Alone
fpr_aa_feasible = (p_aa / (1 - p_aa)) * ((1 - ppv_aa) / ppv_aa) * (1 - fnr_range) # PPVb is set equal to the observed value of PPVw
fpr_wa_feasible = (p_wa / (1 - p_wa)) * ((1 - ppv_aa) / ppv_aa) * (1 - fnr_range) # pw and PPVw are both held fixed

plt.plot(fnr_wa, fpr_wa_alt, 'o', color='orange', label='White Alone')
plt.plot(fnr_aa, fpr_aa_alt, 'o', color='black', label='Black/African American Alone')

plt.plot(fnr_range, fpr_aa_feasible, '-', color='black')
plt.plot(fnr_range, fpr_wa_feasible, '-', color='orange')


# Shading Attempts
# delta = np.abs(ppv_wa - ppv_aa)
# low = (p_wa / (1 - p_wa)) * ((1 - ppv_wa - (delta < 0.05)) / ppv_wa - (delta < 0.05)) * (1 - fnr_range)
# high = (p_wa / (1 - p_wa)) * ((1 - ppv_wa + (delta < 0.05)) / ppv_wa + (delta < 0.05)) * (1 - fnr_range)
# plt.fill_between(fnr_range, low, high, color='blue', alpha=0.3)

plt.xlabel('False Negative Rate')
plt.ylabel('False Positive Rate')
plt.title('Feasible (FNR, FPR) combinations')
plt.legend()

Our current model appears to be working quite well for both groups as we can see their (FNR, FPR) points are not far separated. However, if we did want to equalize our false positive rates (classify someone as employed when they are not), this would necessitate an increase in the false negative rates for White Alone to around 0.22 from its current around 0.12. This would inevitably lead to a drop in accuracy. Contextually, this would mean classifying more White Alone persons who are actually employed as unemployed to match FPR.

Nonetheless, our model performs close to even for both groups.

Conclusion

There are many applications for a model of this sort. Making predictions on who is or isn’t employed would benefit many companies that operate on credit or lending. Knowing who is employed would help determine whether it would be wise to approve a higher credit limit, approve a mortgage application, or let someone lease a car. This is the case because employment can be widely used aa feature or indicator of someones ability to pay back their debts and the interest included therein. If you determine that better, your company can be more profitable. This could also be useful for marketing and advertising companies who can use these predictions to improve the targeting of their ads.

Deploying this specific Random Forest Based model in a large scale setting could make the small discrepancies in error much larger. For instance, lets assume this model is being used by the US government to determine who needs unemployment aid. The small differences in FPRs (WA: ~21% & AA: ~19) could mean tens of millions of people could be classified as employed and not receive the help they need. This would also be slightly disproportionate as millions more White Alone people would not get the aid they need when compared to the ratio of Black/African American individuals. Conversely, this FPR could be to their advantage in a commercial setting that would give them lower interest rates on loans, and possibly better odds at landing a job.

Based on my Bias Audit, most of my standards of evaluation were quite close. None of them differed enough for me to view the model as problematic with regards to White Alone or Black/African American. Of course, the model also used data from other groups, thus necessitating further exploration into those groups to determine if there are problematic levels of bias.

Bias aside, there could be other potential issues in deploying this algorithm. First and foremost, I believe that, with the exception of advertisement targeting, employment status should be requested and declared with the consent of the individual. There should be some government regulations in what industries are allowed to use such algorithms and how. In the situations that it is used, I worry about the transparency of the model processes. Transparency on what data the data is being used for as it is being collected, and where it came from when it is being used. Moreover, there should be transparency and easily interpretable in use cases like credit approvals, etc. Finally, using data from a Georgia may not be generalizable across the country or even in current day Georgia as many things have changed since 2018 and after COVID. There should be an effort made to keep the training data reflective of current day trends. Thus, even though the model appears acceptable with regards to numerical fairness metrics, there are broader ethical and practical concerns that we should address in order to deploy a responsible and effective model.